Overview

The term text mining (aka text analysis) describes a process wherein relevent and/or interesting information is extracted from a corpus of documents. Text is often in an unstructured format so performing even the most basic analysis requires some re-structuring. Thus, an important question to ask before starting any text analysis project is “To what format should I convert my text data to best support the analyses I want to perform?” In R, there are two primay format to choose from: the tidy data format using the tidytext package and the non-tidy data format supported by several packages including tm, text2vec, quanteda, and RWeka.

The tidytext package allows us to analyze the text using the tidy text format: a table with one-token-per-document-per-row. This format allows us to efficiently pipe our analysis directly into the popular suite of tidyverse tools such as dplyr, tidyr, and ggplot2 to explore and visualize the data. Most text analysis/NLP tools in R do not use a tidy text format. The CRAN Task View for Natural Language Processing lists a large selection of packages that take other structures of input and provide non-tidy outputs. These packages are very useful in text mining applications, and many existing text datasets are structured according to these formats. Thus, its extremely important to understand how to convert back-and-forth between different formats.

This document presents several methods analyze text data and leverages the data provided in the harrypotter package created by Brad Boehmke. This package has not been published to the CRAN, but can be installed from GitHub by running the code below:

if (packageVersion("devtools") < 1.6) {
  install.packages("devtools")
}

devtools::install_github("bradleyboehmke/harrypotter")

This package provides the text from the following novels in the Harry Potter series:

  • philosophers_stone: Harry Potter and the Philosophers Stone (1997)
  • chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)
  • prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)
  • goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
  • order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
  • half_blood_prince: Harry Potter and the Half-Blood Prince (2005)
  • deathly_hallows: Harry Potter and the Deathly Hallows (2007) to illustrate various text mining and analysis capabilities.

The text from each book is stored as a character vector with each element representing a single chapter. For instance, the following illustrates the raw text of the first chapter of the philosophers_stone:

harrypotter::philosophers_stone[1]
## [1] "THE BOY WHO LIVED  Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.  Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.  The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.  When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair.  None of them noticed a large, tawny owl flutter past the window.  At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls. \"Little tyke,\" chortled Mr. Dursley as he left the house. He got into his car and backed out of number four's drive.  It was on the corner of the street that he noticed the first sign of something peculiar -- a cat reading a map. For a second, Mr. Dursley didn't realize what he had seen -- then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn't a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back. As Mr. Dursley drove around the corner and up the road, he watched the cat in his mirror. It was now reading the sign that said Privet Drive -- no, looking at the sign; cats couldn't read maps or signs. Mr. Dursley gave himself a little shake and put the cat out of his mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day.  But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes -- the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt -- these people were obviously collecting for something... yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.  Mr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn't, he might have found it harder to concentrate on drills that morning. He didn't see the owls swoop ing past in broad daylight, though people down in the street did; they pointed and gazed open- mouthed as owl after owl sped overhead. Most of them had never seen an owl even at nighttime. Mr. Dursley, however, had a perfectly normal, owl-free morning. He yelled at five different people. He made several important telephone calls and shouted a bit more. He was in a very good mood until lunchtime, when he thought he'd stretch his legs and walk across the road to buy himself a bun from the bakery.  He'd forgotten all about the people in cloaks until he passed a group of them next to the baker's. He eyed them angrily as he passed. He didn't know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn't see a single collecting tin. It was on his way back past them, clutching a large doughnut in a bag, that he caught a few words of what they were saying.   \"The Potters, that's right, that's what I heard yes, their son, Harry\"  Mr. Dursley stopped dead. Fear flooded him. He looked back at the whisperers as if he wanted to say something to them, but thought better of it.  He dashed back across the road, hurried up to his office, snapped at his secretary not to disturb him, seized his telephone, and had almost finished dialing his home number when he changed his mind. He put the receiver back down and stroked his mustache, thinking... no, he was being stupid. Potter wasn't such an unusual name. He was sure there were lots of people called Potter who had a son called Harry. Come to think of it, he wasn't even sure his nephew was called Harry. He'd never even seen the boy. It might have been Harvey. Or Harold. There was no point in worrying Mrs. Dursley; she always got so upset at any mention of her sister. He didn't blame her -- if he'd had a sister like that... but all the same, those people in cloaks...  He found it a lot harder to concentrate on drills that afternoon and when he left the building at five o'clock, he was still so worried that he walked straight into someone just outside the door.  \"Sorry,\" he grunted, as the tiny old man stumbled and almost fell. It was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. He didn't seem at all upset at being almost knocked to the ground. On the contrary, his face split into a wide smile and he said in a squeaky voice that made passersby stare, \"Don't be sorry, my dear sir, for nothing could upset me today! Rejoice, for You-Know-Who has gone at last! Even Muggles like yourself should be celebrating, this happy, happy day!\"  And the old man hugged Mr. Dursley around the middle and walked off.  Mr. Dursley stood rooted to the spot. He had been hugged by a complete stranger. He also thought he had been called a Muggle, whatever that was. He was rattled. He hurried to his car and set off for home, hoping he was imagining things, which he had never hoped before, because he didn't approve of imagination.  As he pulled into the driveway of number four, the first thing he saw -- and it didn't improve his mood -- was the tabby cat he'd spotted that morning. It was now sitting on his garden wall. He was sure it was the same one; it had the same markings around its eyes.  \"Shoo!\" said Mr. Dursley loudly. The cat didn't move. It just gave him a stern look. Was this normal cat behavior? Mr. Dursley wondered. Trying to pull himself together, he let himself into the house. He was still determined not to mention anything to his wife.  Mrs. Dursley had had a nice, normal day. She told him over dinner all about Mrs. Next Door's problems with her daughter and how Dudley had learned a new word (\"Won't!\"). Mr. Dursley tried to act normally. When Dudley had been put to bed, he went into the living room in time to catch the last report on the evening news:  \"And finally, bird-watchers everywhere have reported that the nation's owls have been behaving very unusually today. Although owls normally hunt at night and are hardly ever seen in daylight, there have been hundreds of sightings of these birds flying in every direction since sunrise. Experts are unable to explain why the owls have suddenly changed their sleeping pattern.\" The newscaster allowed himself a grin. \"Most mysterious. And now, over to Jim McGuffin with the weather. Going to be any more showers of owls tonight, Jim?\"  \"Well, Ted,\" said the weatherman, \"I don't know about that, but it's not only the owls that have been acting oddly today. Viewers as far apart as Kent, Yorkshire, and Dundee have been phoning in to tell me that instead of the rain I promised yesterday, they've had a downpour of shooting stars! Perhaps people have been celebrating Bonfire Night early -- it's not until next week, folks! But I can promise a wet night tonight.\"  Mr. Dursley sat frozen in his armchair. Shooting stars all over Britain? Owls flying by daylight? Mysterious people in cloaks all over the place? And a whisper, a whisper about the Potters...  Mrs. Dursley came into the living room carrying two cups of tea. It was no good. He'd have to say something to her. He cleared his throat nervously. \"Er -- Petunia, dear -- you haven't heard from your sister lately, have you?\"  As he had expected, Mrs. Dursley looked shocked and angry. After all, they normally pretended she didn't have a sister.  \"No,\" she said sharply. \"Why?\"  \"Funny stuff on the news,\" Mr. Dursley mumbled. \"Owls... shooting stars... and there were a lot of funny-looking people in town today...\"  \"So?\" snapped Mrs. Dursley.  \"Well, I just thought... maybe... it was something to do with... you know... her crowd.\"  Mrs. Dursley sipped her tea through pursed lips. Mr. Dursley wondered whether he dared tell her he'd heard the name \"Potter.\" He decided he didn't dare. Instead he said, as casually as he could, \"Their son -- he'd be about Dudley's age now, wouldn't he?\"  \"I suppose so,\" said Mrs. Dursley stiffly.  \"What's his name again? Howard, isn't it?\"  \"Harry. Nasty, common name, if you ask me.\"  \"Oh, yes,\" said Mr. Dursley, his heart sinking horribly. \"Yes, I quite agree.\"  He didn't say another word on the subject as they went upstairs to bed. While Mrs. Dursley was in the bathroom, Mr. Dursley crept to the bedroom window and peered down into the front garden. The cat was still there. It was staring down Privet Drive as though it were waiting for something.  Was he imagining things? Could all this have anything to do with the Potters? If it did... if it got out that they were related to a pair of -- well, he didn't think he could bear it.  The Dursleys got into bed. Mrs. Dursley fell asleep quickly but Mr. Dursley lay awake, turning it all over in his mind. His last, comforting thought before he fell asleep was that even if the Potters were involved, there was no reason for them to come near him and Mrs. Dursley. The Potters knew very well what he and Petunia thought about them and their kind.... He couldn't see how he and Petunia could get mixed up in anything that might be going on -- he yawned and turned over -- it couldn't affect them....  How very wrong he was.  Mr. Dursley might have been drifting into an uneasy sleep, but the cat on the wall outside was showing no sign of sleepiness. It was sitting as still as a statue, its eyes fixed unblinkingly on the far corner of Privet Drive. It didn't so much as quiver when a car door slammed on the next street, nor when two owls swooped overhead. In fact, it was nearly midnight before the cat moved at all.  A man appeared on the corner the cat had been watching, appeared so suddenly and silently you'd have thought he'd just popped out of the ground. The cat's tail twitched and its eyes narrowed.  Nothing like this man had ever been seen on Privet Drive. He was tall, thin, and very old, judging by the silver of his hair and beard, which were both long enough to tuck into his belt. He was wearing long robes, a purple cloak that swept the ground, and high-heeled, buckled boots. His blue eyes were light, bright, and sparkling behind half-moon spectacles and his nose was very long and crooked, as though it had been broken at least twice. This man's name was Albus Dumbledore.  Albus Dumbledore didn't seem to realize that he had just arrived in a street where everything from his name to his boots was unwelcome. He was busy rummaging in his cloak, looking for something. But he did seem to realize he was being watched, because he looked up suddenly at the cat, which was still staring at him from the other end of the street. For some reason, the sight of the cat seemed to amuse him. He chuckled and muttered, \"I should have known.\"  He found what he was looking for in his inside pocket. It seemed to be a silver cigarette lighter. He flicked it open, held it up in the air, and clicked it. The nearest street lamp went out with a little pop. He clicked it again -- the next lamp flickered into darkness. Twelve times he clicked the Put-Outer, until the only lights left on the whole street were two tiny pinpricks in the distance, which were the eyes of the cat watching him. If anyone looked out of their window now, even beady-eyed Mrs. Dursley, they wouldn't be able to see anything that was happening down on the pavement. Dumbledore slipped the Put-Outer back inside his cloak and set off down the street toward number four, where he sat down on the wall next to the cat. He didn't look at it, but after a moment he spoke to it.  \"Fancy seeing you here, Professor McGonagall.\"  He turned to smile at the tabby, but it had gone. Instead he was smiling at a rather severe-looking woman who was wearing square glasses exactly the shape of the markings the cat had had around its eyes. She, too, was wearing a cloak, an emerald one. Her black hair was drawn into a tight bun. She looked distinctly ruffled.  \"How did you know it was me?\" she asked.  \"My dear Professor, I 've never seen a cat sit so stiffly.\"  \"You'd be stiff if you'd been sitting on a brick wall all day,\" said Professor McGonagall.  \"All day? When you could have been celebrating? I must have passed a dozen feasts and parties on my way here.\"  Professor McGonagall sniffed angrily.  \"Oh yes, everyone's celebrating, all right,\" she said impatiently. \"You'd think they'd be a bit more careful, but no -- even the Muggles have noticed something's going on. It was on their news.\" She jerked her head back at the Dursleys' dark living-room window. \"I heard it. Flocks of owls... shooting stars.... Well, they're not completely stupid. They were bound to notice something. Shooting stars down in Kent -- I'll bet that was Dedalus Diggle. He never had much sense.\"  \"You can't blame them,\" said Dumbledore gently. \"We've had precious little to celebrate for eleven years.\"  \"I know that,\" said Professor McGonagall irritably. \"But that's no reason to lose our heads. People are being downright careless, out on the streets in broad daylight, not even dressed in Muggle clothes, swapping rumors.\"  She threw a sharp, sideways glance at Dumbledore here, as though hoping he was going to tell her something, but he didn't, so she went on. \"A fine thing it would be if, on the very day YouKnow-Who seems to have disappeared at last, the Muggles found out about us all. I suppose he really has gone, Dumbledore?\"  \"It certainly seems so,\" said Dumbledore. \"We have much to be thankful for. Would you care for a lemon drop?\"  \"A what?\"  \"A lemon drop. They're a kind of Muggle sweet I'm rather fond of\"  \"No, thank you,\" said Professor McGonagall coldly, as though she didn't think this was the moment for lemon drops. \"As I say, even if You-Know-Who has gone -\"  \"My dear Professor, surely a sensible person like yourself can call him by his name? All this 'You- Know-Who' nonsense -- for eleven years I have been trying to persuade people to call him by his proper name: Voldemort.\" Professor McGonagall flinched, but Dumbledore, who was unsticking two lemon drops, seemed not to notice. \"It all gets so confusing if we keep saying 'You-Know-Who.' I have never seen any reason to be frightened of saying Voldemort's name.  \"I know you haven 't, said Professor McGonagall, sounding half exasperated, half admiring. \"But you're different. Everyone knows you're the only one You-Know- oh, all right, Voldemort, was frightened of.\"  \"You flatter me,\" said Dumbledore calmly. \"Voldemort had powers I will never have.\"  \"Only because you're too -- well -- noble to use them.\"  \"It's lucky it's dark. I haven't blushed so much since Madam Pomfrey told me she liked my new earmuffs.\"  Professor McGonagall shot a sharp look at Dumbledore and said, \"The owls are nothing next to the rumors that are flying around. You know what everyone's saying? About why he's disappeared? About what finally stopped him?\"  It seemed that Professor McGonagall had reached the point she was most anxious to discuss, the real reason she had been waiting on a cold, hard wall all day, for neither as a cat nor as a woman had she fixed Dumbledore with such a piercing stare as she did now. It was plain that whatever \"everyone\" was saying, she was not going to believe it until Dumbledore told her it was true. Dumbledore, however, was choosing another lemon drop and did not answer.  \"What they're saying,\" she pressed on, \"is that last night Voldemort turned up in Godric's Hollow. He went to find the Potters. The rumor is that Lily and James Potter are -- are -- that they're -- dead. \"  Dumbledore bowed his head. Professor McGonagall gasped.  \"Lily and James... I can't believe it... I didn't want to believe it... Oh, Albus...\"  Dumbledore reached out and patted her on the shoulder. \"I know... I know...\" he said heavily.  Professor McGonagall's voice trembled as she went on. \"That's not all. They're saying he tried to kill the Potter's son, Harry. But -- he couldn't. He couldn't kill that little boy. No one knows why, or how, but they're saying that when he couldn't kill Harry Potter, Voldemort's power somehow broke -- and that's why he's gone.  Dumbledore nodded glumly.  \"It's -- it's true?\" faltered Professor McGonagall. \"After all he's done... all the people he's killed... he couldn't kill a little boy? It's just astounding... of all the things to stop him... but how in the name of heaven did Harry survive?\"  \"We can only guess,\" said Dumbledore. \"We may never know.\"  Professor McGonagall pulled out a lace handkerchief and dabbed at her eyes beneath her spectacles. Dumbledore gave a great sniff as he took a golden watch from his pocket and examined it. It was a very odd watch. It had twelve hands but no numbers; instead, little planets were moving around the edge. It must have made sense to Dumbledore, though, because he put it back in his pocket and said, \"Hagrid's late. I suppose it was he who told you I'd be here, by the way?\"  \"Yes,\" said Professor McGonagall. \"And I don't suppose you're going to tell me why you're here, of all places?\"  \"I've come to bring Harry to his aunt and uncle. They're the only family he has left now.\"  \"You don't mean -- you can't mean the people who live here?\" cried Professor McGonagall, jumping to her feet and pointing at number four. \"Dumbledore -- you can't. I've been watching them all day. You couldn't find two people who are less like us. And they've got this son -- I saw him kicking his mother all the way up the street, screaming for sweets. Harry Potter come and live here!\"  \"It's the best place for him,\" said Dumbledore firmly. \"His aunt and uncle will be able to explain everything to him when he's older. I've written them a letter.\"  \"A letter?\" repeated Professor McGonagall faintly, sitting back down on the wall. \"Really, Dumbledore, you think you can explain all this in a letter? These people will never understand him! He'll be famous -- a legend -- I wouldn't be surprised if today was known as Harry Potter day in the future -- there will be books written about Harry -- every child in our world will know his name!\"  \"Exactly,\" said Dumbledore, looking very seriously over the top of his half-moon glasses. \"It would be enough to turn any boy's head. Famous before he can walk and talk! Famous for something he won't even remember! CarA you see how much better off he'll be, growing up away from all that until he's ready to take it?\"  Professor McGonagall opened her mouth, changed her mind, swallowed, and then said, \"Yes -- yes, you're right, of course. But how is the boy getting here, Dumbledore?\" She eyed his cloak suddenly as though she thought he might be hiding Harry underneath it.  \"Hagrid's bringing him.\"  \"You think it -- wise -- to trust Hagrid with something as important as this?\"  I would trust Hagrid with my life,\" said Dumbledore.  \"I'm not saying his heart isn't in the right place,\" said Professor McGonagall grudgingly, \"but you can't pretend he's not careless. He does tend to -- what was that?\"  A low rumbling sound had broken the silence around them. It grew steadily louder as they looked up and down the street for some sign of a headlight; it swelled to a roar as they both looked up at the sky -- and a huge motorcycle fell out of the air and landed on the road in front of them.  If the motorcycle was huge, it was nothing to the man sitting astride it. He was almost twice as tall as a normal man and at least five times as wide. He looked simply too big to be allowed, and so wild - long tangles of bushy black hair and beard hid most of his face, he had hands the size of trash can lids, and his feet in their leather boots were like baby dolphins. In his vast, muscular arms he was holding a bundle of blankets.  \"Hagrid,\" said Dumbledore, sounding relieved. \"At last. And where did you get that motorcycle?\"  \"Borrowed it, Professor Dumbledore, sit,\" said the giant, climbing carefully off the motorcycle as he spoke. \"Young Sirius Black lent it to me. I've got him, sir.\"  \"No problems, were there?\"  \"No, sir -- house was almost destroyed, but I got him out all right before the Muggles started swarmin' around. He fell asleep as we was flyin' over Bristol.\"  Dumbledore and Professor McGonagall bent forward over the bundle of blankets. Inside, just visible, was a baby boy, fast asleep. Under a tuft of jet-black hair over his forehead they could see a curiously shaped cut, like a bolt of lightning.  \"Is that where -?\" whispered Professor McGonagall.  \"Yes,\" said Dumbledore. \"He'll have that scar forever.\"  \"Couldn't you do something about it, Dumbledore?\"  \"Even if I could, I wouldn't. Scars can come in handy. I have one myself above my left knee that is a perfect map of the London Underground. Well -- give him here, Hagrid -- we'd better get this over with.\"  Dumbledore took Harry in his arms and turned toward the Dursleys' house.  \"Could I -- could I say good-bye to him, sir?\" asked Hagrid. He bent his great, shaggy head over Harry and gave him what must have been a very scratchy, whiskery kiss. Then, suddenly, Hagrid let out a howl like a wounded dog.  \"Shhh!\" hissed Professor McGonagall, \"you'll wake the Muggles!\"  \"S-s-sorry,\" sobbed Hagrid, taking out a large, spotted handkerchief and burying his face in it. \"But I c-c-can't stand it -- Lily an' James dead -- an' poor little Harry off ter live with Muggles -\"  \"Yes, yes, it's all very sad, but get a grip on yourself, Hagrid, or we'll be found,\" Professor McGonagall whispered, patting Hagrid gingerly on the arm as Dumbledore stepped over the low garden wall and walked to the front door. He laid Harry gently on the doorstep, took a letter out of his cloak, tucked it inside Harry's blankets, and then came back to the other two. For a full minute the three of them stood and looked at the little bundle; Hagrid's shoulders shook, Professor McGonagall blinked furiously, and the twinkling light that usually shone from Dumbledore's eyes seemed to have gone out.  \"Well,\" said Dumbledore finally, \"that's that. We've no business staying here. We may as well go and join the celebrations.\"  \"Yeah,\" said Hagrid in a very muffled voice, \"I'll be takin' Sirius his bike back. G'night, Professor McGonagall -- Professor Dumbledore, sir.\"  Wiping his streaming eyes on his jacket sleeve, Hagrid swung himself onto the motorcycle and kicked the engine into life; with a roar it rose into the air and off into the night.  \"I shall see you soon, I expect, Professor McGonagall,\" said Dumbledore, nodding to her. Professor McGonagall blew her nose in reply.  Dumbledore turned and walked back down the street. On the corner he stopped and took out the silver Put-Outer. He clicked it once, and twelve balls of light sped back to their street lamps so that Privet Drive glowed suddenly orange and he could make out a tabby cat slinking around the corner at the other end of the street. He could just see the bundle of blankets on the step of number four.  \"Good luck, Harry,\" he murmured. He turned on his heel and with a swish of his cloak, he was gone.  A breeze ruffled the neat hedges of Privet Drive, which lay silent and tidy under the inky sky, the very last place you would expect astonishing things to happen. Harry Potter rolled over inside his blankets without waking up. One small hand closed on the letter beside him and he slept on, not knowing he was special, not knowing he was famous, not knowing he would be woken in a few hours' time by Mrs. Dursley's scream as she opened the front door to put out the milk bottles, nor that he would spend the next few weeks being prodded and pinched by his cousin Dudley... He couldn't know that at this very moment, people meeting in secret all over the country were holding up their glasses and saying in hushed voices: \"To Harry Potter -- the boy who lived!"

Additionally, we’ll need to install and load the following packages to help with this analysis.

if (!require("pacman")) install.packages("pacman")

pacman::p_load(tm, 
               pdftools, 
               here,
               tau,
               tidyverse,
               stringr,
               tidytext, 
               RColorBrewer,
               qdap,
               qdapRegex,
               qdapDictionaries,
               qdapTools,
               data.table,
               coreNLP,
               scales,
               harrypotter,
               text2vec,
               SnowballC,
               DT,
               quanteda,
               RWeka,
               broom,
               tokenizers,
               grid,
               knitr,
               widyr)

pacman::p_load_gh("dgrtwo/drlib",
                  "trinker/termco", 
                  "trinker/coreNLPsetup",        
                  "trinker/tagger")

Basic Text Mining and Visualization

Using the tidytext package

In this section, we analyzing text using the tidy text format: a table with one-token-per-document-per-row, such as is constructed by the unnest_tokens function. This allows us to efficiently pipe our analysis directly into the popular suite of ‘tidyverse’ tools such as to explore and visualize text data. Although we can do some simple regex analysis on this character vector, to properly analyze this text using tidytext we’ll want to turn it into a data.frame or tibble. To do this on the philosophers_stone novel we could perform the following:

Tokenization

text_tb <- tibble::tibble(chapter = base::seq_along(philosophers_stone),
                          text = philosophers_stone)

text_tb

This creates a 2-column tibble. The second column contains the full text for each chapter; however, this isn’t very conducive to future analyses. A better option would be to ‘unnest’ the documents by each token. A token is any subdivision of the text that is meaningful to us, thus a token could be a word (uni-gram), a bi-gram, a tri-gram, a line, or a sentence. We can unnest the text of philosophers_stone using each word as a token using the code below:

text_tb %>%
        tidytext::unnest_tokens(word, text, token = 'words')
      # tidytext::unnest_tokens(bigram  , text, token = 'ngrams', n = 2)
      # tidytext::unnest_tokens(sentence, text, token = 'sentences')

Now we’ve split up the entire philosophers_stone text into a tibble that provides each word in each chapter. Its important to note that the unnest_token function does the following:

  • splits the text into tokens
  • strips all punctuation
  • converts each word to lowercase for easy comparability (use the to_lower = FALSE argument to turn this off)

However, what if we want to analyze text across all seven novels? To do this we can perform the same steps by looping through each novel and then combining them.

titles <- c("Philosopher's Stone", 
            "Chamber of Secrets", 
            "Prisoner of Azkaban",
            "Goblet of Fire", 
            "Order of the Phoenix", 
            "Half-Blood Prince",
            "Deathly Hallows")

books <- list(philosophers_stone, 
              chamber_of_secrets, 
              prisoner_of_azkaban,
              goblet_of_fire, 
              order_of_the_phoenix, 
              half_blood_prince,
              deathly_hallows)
  
hp_tidy <- tibble::tibble()

for(i in seq_along(titles)) {
        
        clean <- tibble::tibble(chapter = base::seq_along(books[[i]]),
                                text = books[[i]]) %>%
             tidytext::unnest_tokens(word, text) %>%
             dplyr::mutate(book = titles[i]) %>%
             dplyr::select(book, dplyr::everything())

        hp_tidy <- base::rbind(hp_tidy, clean)
}

# set factor to keep books in order of publication
hp_tidy$book <- base::factor(hp_tidy$book, levels = base::rev(titles))

hp_tidy

We now have a tidy tibble with every individual word by chapter and by book and can begin performing some simple analyses

Word Frequency with tidytext

The simplest word frequency analysis is assessing the most common words in text. We can use count to assess the most common words across all the text in the Harry Potter series.

hp_tidy %>%
        dplyr::count(word, sort = TRUE)

One thing you will notice is that a lot of the most common words are not very informative (i.e. the, and, to, of, a, he, …). These are considered stop words. Most of the time we want our text mining to identify words that provide context (i.e. harry, dumbledore, granger, afraid, etc.). Thus, we can remove the stop words from our tibble with anti_join and the built-in stop_words data set provided by tidytext. Now we start to see characters and other nouns, verbs, and adjectives that we would expect to be common in this series.

hp_tidy %>%
        dplyr::anti_join(stop_words) %>%
        dplyr::count(word, sort = TRUE)

We can perform this same assessment but grouped by book or even each chapter within each book.

# top 10 most common words in each book
hp_tidy %>%
        dplyr::anti_join(stop_words) %>%
        dplyr::group_by(book) %>%
        dplyr::count(word, sort = TRUE) %>%
        dplyr::top_n(10)

We can visualize this with the ggplot2 package

# top 10 most common words in each book
hp_tidy %>%
        anti_join(stop_words) %>%
        group_by(book) %>%
        count(word, sort = TRUE) %>%
        top_n(10) %>%
        ungroup() %>%
        mutate(book = base::factor(book, levels = titles),
               text_order = base::nrow(.):1) %>%
## Pipe output directly to ggplot
        ggplot(aes(reorder(word, text_order), n, fill = book)) +
          geom_bar(stat = "identity") +
          facet_wrap(~ book, scales = "free_y") +
          labs(x = "NULL", y = "Frequency") +
          coord_flip() +
          theme(legend.position="none")

Now, let’s calculate the frequency for each word across the entire Harry Potter series versus within each book. This will allow us to compare strong deviations of word frequency within each book as compared to across the entire series.

# calculate percent of word use across all novels
potter_pct <- hp_tidy %>%
        dplyr::anti_join(stop_words) %>%
        dplyr::count(word) %>%
        dplyr::transmute(word, all_words = n / sum(n))

# calculate percent of word use within each novel
frequency <- hp_tidy %>%
        dplyr::anti_join(stop_words) %>%
        dplyr::count(book, word) %>%
        dplyr::mutate(book_words = n / sum(n)) %>%
        dplyr::left_join(potter_pct) %>%
        dplyr::arrange(dplyr::desc(book_words)) %>%
        dplyr::ungroup()
        
frequency

We can visualize this again with ggplot2 as shown below

ggplot(frequency, 
       aes(x = book_words, 
           y = all_words, 
           color = abs(all_words - book_words))) +
        geom_abline(color = "gray40", lty = 2) +
        geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
        geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
        scale_x_log10(labels = scales::percent_format()) +
        scale_y_log10(labels = scales::percent_format()) +
        scale_color_gradient(limits = c(0, 0.001), 
                             low = "darkslategray4", 
                             high = "gray75") +
        facet_wrap(~ book, ncol = 2) +
        theme(legend.position="none") +
        labs(y = "Harry Potter Series", x = NULL)

Words that are close to the line in these plots have similar frequencies across all the novels. For example, words such as “harry”, “ron”, “dumbledore” are fairly common and used with similar frequencies across most of the books. Words that are far from the line are words that are found more in one set of texts than another. Furthermore, words standing out above the line are common across the series but not within that book; whereas words below the line are common in that particular book but not across the series. For example, “cedric” stands out above the line in the Half-Blood Prince. This means that “cedric” is fairly common across the entire Harry Potter series but is not used as much in Half-Blood Prince. In contrast, a word below the line such as “quirrell” in the Philosopher’s Stone suggests this word is common in this novel but far less common across the series.

Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between the entire series and each book?

frequency %>%
        dplyr::group_by(book) %>%
        dplyr::summarize(correlation = stats::cor(book_words, all_words),
                         p_value = stats::cor.test(book_words,
                                                   all_words)$p.value)

The high correlations, which are all statistically significant (p-values < 0.0001), suggests that the relationship between the word frequencies is highly similar across the entire Harry Potter series.

Using the tm Package

Text analysis requires working with a variety of tools, many of which have inputs and outputs that aren’t in a tidy form. This section borrows from Chapter 5 of Text Mining with R to show how to convert between a tidy text data frame and sparse document-term matrices, as well as how to tidy a Corpus object containing document metadata.

One of the most common structures that text mining packages work with is the document-term matrix (or DTM). This is a matrix where:

  • each row represents one document (such as a book or article),
  • each column represents one term, and
  • each value (typically) contains the number of appearances of that term in that document.

Since most pairings of document and term do not occur (they have the value zero), DTMs are usually implemented as sparse matrices. These objects can be treated as though they were matrices (for example, accessing particular rows and columns), but are stored in a more efficient format. We’ll discuss several implementations of these matrices in this tutorial.

Perhaps the most widely used implementation of DTMs in R is the DocumentTermMatrix object class in the tm package. Many available text mining datasets are provided in this format. Here, we convert the seven books into a DocumentTermMatrix:

hp_dtm <- tm::VectorSource(books) %>%
  tm::VCorpus() %>%
  tm::DocumentTermMatrix(control = base::list(removePunctuation = TRUE,
                                              removeNumbers = TRUE,
                                              stopwords = tidytext::stop_words[,2],
                                              tokenize = 'MC',
                                              weighting =
                                         function(x)
                                         weightTfIdf(x, normalize =
                                                     !FALSE)))

tm::inspect(hp_dtm)
## <<DocumentTermMatrix (documents: 7, terms: 22035)>>
## Non-/sparse entries: 44631/109614
## Sparsity           : 71%
## Maximal term length: 22
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##     Terms
## Docs       bagman        dobby     griphook     kreacher     lockhart
##    1 0.000000e+00 0.0000000000 0.0003440724 0.0000000000 0.000000e+00
##    2 0.000000e+00 0.0010608744 0.0000000000 0.0000000000 2.716295e-03
##    3 0.000000e+00 0.0000000000 0.0000000000 0.0000000000 0.000000e+00
##    4 2.630441e-03 0.0005298709 0.0000000000 0.0000000000 1.546094e-05
##    5 1.716721e-05 0.0001406306 0.0000000000 0.0006676282 4.984643e-05
##    6 0.000000e+00 0.0001326675 0.0000000000 0.0005099126 5.806596e-06
##    7 0.000000e+00 0.0001382084 0.0014206918 0.0010138371 0.000000e+00
##     Terms
## Docs         luna        lupin    slughorn     umbridge        winky
##    1 0.0000000000 0.000000e+00 0.000000000 0.0000000000 0.000000e+00
##    2 0.0000000000 0.000000e+00 0.000000000 0.0000000000 0.000000e+00
##    3 0.0000000000 2.277094e-03 0.000000000 0.0000000000 0.000000e+00
##    4 0.0000000000 2.478928e-05 0.000000000 0.0000000000 1.799775e-03
##    5 0.0008301812 3.780888e-04 0.000000000 0.0033149192 3.433441e-05
##    6 0.0005187042 2.164575e-04 0.005121495 0.0001054992 0.000000e+00
##    7 0.0010592327 5.017564e-04 0.000123052 0.0003404677 0.000000e+00

We see that this DTM-class object contains 7 documents along the rows and terms (distinct words) along the columns. Notice that this DTM is 73% sparse (73% of document-word pairs are zero). We could access the terms in the document with the Terms() function.

terms <- tm::Terms(hp_dtm)
utils::head(terms, 50)
##  [1] "aaaaaaaaaargh"   "aaaaaaaaargh"    "aaaaaaaargh"    
##  [4] "aaaaaaaarrrrrgh" "aaaaaaand"       "aaaaaaarrrgh"   
##  [7] "aaaaaah"         "aaaaaand"        "aaaaah"         
## [10] "aaaaahed"        "aaaaargh"        "aaaah"          
## [13] "aaaargh"         "aaah"            "aaargh"         
## [16] "aaarrgghh"       "aah"             "aargh"          
## [19] "aback"           "abandon"         "abandoned"      
## [22] "abandoning"      "abandonment"     "abashed"        
## [25] "abate"           "abated"          "abbot"          
## [28] "abbott"          "abbotts"         "abercrombie"    
## [31] "aberdeen"        "aberforth"       "abergavenny"    
## [34] "abetted"         "abide"           "abided"         
## [37] "abiding"         "abilities"       "ability"        
## [40] "abject"          "ablaze"          "able"           
## [43] "abnormal"        "abnormality"     "abnormally"     
## [46] "aboard"          "abomination"     "abou"           
## [49] "aboui"           "abound"

DTM objects cannot be used directly with tidy tools, just as tidy data frames cannot be used as input for most text mining packages. Thus, the tidytext package provides two functions that convert between the two formats. If we wanted to analyze this data with tidy tools, we would first need to turn it into a data frame with one-token-per-document-per-row. The broom package introduced the tidy() function, which takes a non-tidy object and turns it into a tidy data frame. The tidytext package implements this method for DocumentTermMatrix objects.

(hp_tidy_tm <- tidytext::tidy(hp_dtm))

Notice that we now have a tidy three-column data frame, with variables document, term, and count. This form is convenient for analysis with the dplyr, tidytext and ggplot2 packages. Note also that the tidytext package contains several other methods for tidying objects of other classes as shown below.

tt_funcs <- base::ls(base::getNamespace("tidytext"), 
                     all.names = TRUE)

base::grep(pattern = '^tidy.', tt_funcs, value = T)
##  [1] "tidy.corpus"                "tidy.Corpus"               
##  [3] "tidy.CTM"                   "tidy.dfm"                  
##  [5] "tidy.dfmSparse"             "tidy.dictionary2"          
##  [7] "tidy.DocumentTermMatrix"    "tidy.estimateEffect"       
##  [9] "tidy.jobjRef"               "tidy.LDA"                  
## [11] "tidy.simple_triplet_matrix" "tidy.STM"                  
## [13] "tidy.TermDocumentMatrix"    "tidy_topicmodels"          
## [15] "tidy_triplet"

Likewise, we can use the cast method, which provides three functions for converting a tidy text object to an object to another class that may be useful with other packages. This casting process allows for reading, filtering, and processing to be done using dplyr and other tidyverse tools, after which the data can be converted into an object that can be use by other machine learning tools. Examples of these cast functions are shown below

# cast tidy data to a DFM object 
# for use with the quanteda package
hp_tidy_tm %>%
  cast_dfm(term, document, count)
## Document-feature matrix of: 19,291 documents, 7 features (66.9% sparse).
# cast tidy data to a DocumentTermMatrix 
# object for use with the `tm` package
hp_tidy_tm %>%
  cast_dtm(term, document, count)
## <<DocumentTermMatrix (documents: 19291, terms: 7)>>
## Non-/sparse entries: 44631/90406
## Sparsity           : 67%
## Maximal term length: 1
## Weighting          : term frequency (tf)
# cast tidy data to a TermDocumentMatrix 
# object for use with the `tm` package
hp_tidy_tm %>%
  cast_tdm(term, document, count)
## <<TermDocumentMatrix (terms: 19291, documents: 7)>>
## Non-/sparse entries: 44631/90406
## Sparsity           : 67%
## Maximal term length: 22
## Weighting          : term frequency (tf)
# cast tidy data to a sparse matrix
# uses the Matrix package
hp_tidy_tm %>%
  cast_sparse(term, document, count) %>%
  dim
## [1] 19291     7

Using the text2vec Package

t2v_tokens = books   %>% 
             tolower %>% 
             tokenizers::tokenize_words()

t2v_itoken = text2vec::itoken(t2v_tokens, 
                              progressbar = FALSE)

(t2v_vocab = text2vec::create_vocabulary(t2v_itoken,
                                         stopwords = tidytext::stop_words[[1]]))
t2v_dtm = create_dtm(t2v_itoken, hash_vectorizer())
model_tfidf = TfIdf$new()
dtm_tfidf = model_tfidf$fit_transform(t2v_dtm)

Text Mining - Term vs. Document Frequency

Working with \(tf_idf\)

So far we’ve identified the frequency of individual terms within a document. However, it’s also important to understand the importance that each word provides within an individual document and across a corpus of documents. In the previous section we saw computed a crude measure of term frequency \((tf_{t,d})\) that identifies how frequently term \(t\) occurs in document \(d\). Formally, the term frequency measure is expressed as

\[ tf_{t,d} = \begin{cases} 1+\log_{10}\Big(\mbox{count(t,d)}\Big)\hspace{10pt} \mbox{if count(t,d) > 0}\\\\ 0 \hspace{260px}\mbox{ otherwise } \end{cases} \]

Another approach is to use what is called a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. The idf is defined as:

\[ \mbox{idf}(t,D) = \log\left(\frac{N}{n_{t}}\right) \]

where the idf of a given term (t) in a set of documents (D) is a function of the total number of documents being assessed (N) and the number of documents where the term t appears (\[n_t\]).

In addition, we can combine tf and idf statistics into a single tf-idf statistic, which computes the frequency of a term adjusted for how rarely it is used. Since the ratio inside the idf’s log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0. tf-idf is defined as

\[ \mbox{tf-idf}(t,d,D) = \mbox{tf}(t,d) \cdot \mbox{idf}(t,D) \]

where tf-idf for a particular term (t) in document (d) for a set of documents (D) is simply the product of that term’s tf and idf statistics. This tutorial will walk you through the process of computing these values so that you can identify high frequency words that provide particularly important context to a single document within a group of documents.

Term Frequencies

To compute term frequencies we need to have our data in a tidy format. The following converts all seven Harry Potter novels into a tibble that has each word by chapter by book. See the tidy text tutorial for more details.

From this cleaned up text we can compute the term frequency for each word. Lets do this for computing term frequencies by book and across the entire Harry Potter series:

book_words <- hp_tidy %>%
        count(book, word, sort = TRUE) %>%
        dplyr::anti_join(stop_words) %>%
        ungroup()

series_words <- book_words %>%
        group_by(book) %>%
        summarise(total = sum(n))

book_words <- left_join(book_words, series_words)

book_words

Here we’ll look at the distribution of n/total. Since the distribution is so clustered around 0 I add scale_x_log10() to spread it out. Even so we see the long right tails for thos extremely common words.

book_words %>%
        mutate(ratio = n / total) %>%
        ggplot(aes(ratio, fill = book)) +
        geom_histogram(show.legend = FALSE) +
        scale_x_log10() +
        facet_wrap(~ book, ncol = 2)

Inverse Document Frequency and tf-idf

The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the Harry Potter series. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Or put another way, tf-idf helps to find the important words that can provide specific document context. We can easily compute the idf and tf-idf using the bind_tf_idf function provided by the tidytext package.

book_words <- book_words %>%
        bind_tf_idf(word, book, n)

book_words

We can look at the words that have the highest tf-idf values. Here we see mainly names for characters in each book that are unique to that book, and therefore used often, but are absent or nearly absent in the other books.

book_words %>%
        dplyr::arrange(dplyr::desc(tf_idf))

To understand the most common contextual words in each book we can take a look at the top 15 terms with the highest tf-idf.

book_words %>%
        dplyr::arrange(dplyr::desc(tf_idf)) %>%
        dplyr::mutate(word = base::factor(word, levels = base::rev(base::unique(word))),
               book = base::factor(book, levels = titles)) %>% 
        dplyr::group_by(book) %>%
        dplyr::top_n(15, wt = tf_idf) %>%
        dplyr::ungroup() %>%
        ggplot(aes(word, tf_idf, fill = book)) +
        geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
        labs(title = "Highest tf-idf words in the Harry Potter series",
             x = NULL, y = "tf-idf") +
        facet_wrap(~book, ncol = 2, scales = "free") +
        coord_flip()

As you can see most of these high ranking tf-idf words are nouns that provide specific context around the the most common characters in each individual book.